Take-home Exercise 4: Prototyping Time Series Module for Visual Analytics Shiny Application

Author

Goh Si Hui

Published

February 25, 2024

Modified

March 14, 2024

1 About this Exercise

In this exercise, we will be preparing the time-series forecasting module of the proposed Shiny Application and complete the following task:

  • Evaluate and determine the necessary R packages needed for my group project’s Shiny application;

  • Prepare and test the specific R codes can be run and returned the correct output as expected;

  • Determine the parameters and outputs that will be exposed on the Shiny applications; and

  • Select the appropriate Shiny UI components for exposing the parameters determined above.

2 Getting Started

Before we start, let us ensure that the required R packages have been installed and import the relevant data for this exercise.

2.1 Loading R Packages

For this exercise, we will be using the following packages:

  • tidyverse

  • xts

  • lubridate

  • tsbox

  • imputeTS

  • DT

  • ggplot2

  • plotly

  • ggthemes

  • hrbrthemes

  • forecast

  • MLmetrics

The code chunk below uses p_load() of pacman package to check if the packages are installed in the computer. If they are, they will be launched in R. Otherwise, pacman will install the relevant packages before launching them.

Show the code
pacman::p_load(tidyverse, lubridate, knitr, DT, ggplot2, plotly, ggthemes, ggfortify, forecast, MLmetrics, tsbox, xts, imputeTS, tseries, hrbrthemes, autoplotly)

2.2 Importing Data Into R

For this exercise, we will be using the historical daily weather records from weather.gov.sg. We retrieved the daily records data from Jan 1980 to Dec 2023 via data.gov.sg’s API. The daily historical weather records is in csv file format.

We use read_csv() function of readr to import the daily_historical csv file into R then we will use glimpse() of dplyr to learn about the associated attribute information in the dataframe.

Show the code
data <- read_csv("data/daily_historical.csv")
glimpse(data)
Observations
  • There is no date but there are columns year, month and day. In addition, we also note that R currently read columns year, month and day as numeric data.

  • There are different columns for rainfall and temperature so we will need to select and filter the relevant columns that we want in subsequent steps.

  • The entire dataset daily_historical.csv is very large. We should save the filtered data into an R data format (RDS) so that we can easily retrieve it in future without importing the entire daily_historical.csv dataset again.

2.3 Creating a date column

Let us first create a date column, called tdate using paste(), mutate() and lubridate’s ymd() to convert the numeric data type into a date data type and year-month-day format.

Show the code
data$tdate <- paste(data$year, "-", data$month, "-", data$day)
data <- data %>%
  mutate(tdate = ymd(tdate))

glimpse(data)

3 Temperature Data

Now we will select the relevant temperature related columns needed for this exerise using select(). We will examine the resultant dataframe temp using str().

Show the code
temp <- data %>%
  select(station, tdate, mean_temperature, maximum_temperature, minimum_temperature) 

str(temp)
Observations
  • The resultant dataframe (temp) is a tibble dataframe with the following columns:
  • station: refers to the weather station that collected this temperature reading.
  • tdate: refers to the date of the temperature reading collected.
  • mean_temperature: refers to the mean temperature reading of that day.
  • maximum_temperature: refers to the highest temperature reading of that day.
  • minimum_temperature: refers to the lowest temperature reading of that day.

Let us save the dataframe into an RDS file for future usage using write_rds().

Show the code
write_rds(temp, "data/temperature.rds")

We will bring in the temperature data using read_rds(). Let us check the imported RDS data using str() again.

Show the code
temp <- read_rds("data/temperature.rds")
str(temp)
tibble [329,156 × 5] (S3: tbl_df/tbl/data.frame)
 $ station            : chr [1:329156] "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" ...
 $ tdate              : Date[1:329156], format: "1980-01-01" "1980-01-02" ...
 $ mean_temperature   : num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
 $ maximum_temperature: num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
 $ minimum_temperature: num [1:329156] NA NA NA NA NA NA NA NA NA NA ...
Observations
  • temp is a tibble dataframe.
  • The data type for the columns are in order with station being treated as character data type, tdate as date data type and the temperature related columns are seen as numeric.
  • There seems to be a large number of missing temperature. Let us investigate further in the subsequent steps.

3.1 Investigating missing values

First, let us use summary to have a sense of the missing data.

Show the code
summary(temp)
   station              tdate            mean_temperature maximum_temperature
 Length:329156      Min.   :1980-01-01   Min.   :22.20    Min.   :22.80      
 Class :character   1st Qu.:1997-04-29   1st Qu.:27.10    1st Qu.:30.50      
 Mode  :character   Median :2011-09-18   Median :27.90    Median :31.70      
                    Mean   :2007-07-02   Mean   :27.87    Mean   :31.49      
                    3rd Qu.:2017-11-27   3rd Qu.:28.80    3rd Qu.:32.60      
                    Max.   :2023-12-31   Max.   :31.50    Max.   :38.00      
                    NA's   :58           NA's   :255645   NA's   :255282     
 minimum_temperature
 Min.   :20.0       
 1st Qu.:24.3       
 Median :25.2       
 Mean   :25.3       
 3rd Qu.:26.3       
 Max.   :30.0       
 NA's   :255283     
Observations
  • The observations ranged from 1 Jan 1980 to 31 Dec 2023. There are 58 rows with missing dates. We should drop these rows since they are unable to tell us which day the observations were made (even if they have temperature readings).

  • There are 255,645 rows of NAs in daily mean temperature.

  • There are 255,282 rows of NAs in daily maximum temperature.

  • There are 255,283 rows of NAs in daily minimum temperature.

  • We noted that there are a lot of missing values. As the aim of this task is to forecast future temperatures, missing value treatment is not so straight-forward. Imputation using mean, median & mode might hide trends or seasonal patterns whereas removing missing data points altogether might reduce information contained in other features. Let’s understand more about the missing values before we proceed to do imputation for missing values.

First, let us drop those rows where date is missing because we would not be able to definitively identify when the temperature(s) were collected (even if there were temperature readings for these rows.

Show the code
temp <- temp %>%
  drop_na(tdate)

summary(temp)
   station              tdate            mean_temperature maximum_temperature
 Length:329098      Min.   :1980-01-01   Min.   :22.20    Min.   :22.80      
 Class :character   1st Qu.:1997-04-29   1st Qu.:27.10    1st Qu.:30.50      
 Mode  :character   Median :2011-09-18   Median :27.90    Median :31.70      
                    Mean   :2007-07-02   Mean   :27.87    Mean   :31.49      
                    3rd Qu.:2017-11-27   3rd Qu.:28.80    3rd Qu.:32.60      
                    Max.   :2023-12-31   Max.   :31.50    Max.   :38.00      
                                         NA's   :255587   NA's   :255224     
 minimum_temperature
 Min.   :20.0       
 1st Qu.:24.3       
 Median :25.2       
 Mean   :25.3       
 3rd Qu.:26.3       
 Max.   :30.0       
 NA's   :255225     

3.2 Further investigation of missing temperatures using plotly

We noted that there are many weather stations in the temp dataframe. Hence, we will make use of plotly to further explore the missing temperatures.

3.2.1 Daily Mean Temperatures

Let us first explore the daily mean temperatures by selecting the relevant columns and pivot the dataframe wider.

Show the code
temp_mean_wide <- temp %>%
  select(tdate, station, mean_temperature) %>%
  pivot_wider(names_from = station, values_from = mean_temperature)

summary(temp_mean_wide)
     tdate            Macritchie Reservoir Lower Peirce Reservoir
 Min.   :1980-01-01   Min.   : NA          Min.   : NA           
 1st Qu.:1990-12-31   1st Qu.: NA          1st Qu.: NA           
 Median :2001-12-31   Median : NA          Median : NA           
 Mean   :2001-12-31   Mean   :NaN          Mean   :NaN           
 3rd Qu.:2012-12-30   3rd Qu.: NA          3rd Qu.: NA           
 Max.   :2023-12-31   Max.   : NA          Max.   : NA           
                      NA's   :16071        NA's   :16071         
   Admiralty     East Coast Parkway   Ang Mo Kio        Newton     
 Min.   :22.50   Min.   :23.40      Min.   :22.50   Min.   :22.20  
 1st Qu.:26.80   1st Qu.:27.50      1st Qu.:26.90   1st Qu.:26.80  
 Median :27.60   Median :28.20      Median :27.80   Median :27.70  
 Mean   :27.66   Mean   :28.13      Mean   :27.82   Mean   :27.58  
 3rd Qu.:28.50   3rd Qu.:28.90      3rd Qu.:28.80   3rd Qu.:28.40  
 Max.   :30.80   Max.   :30.80      Max.   :31.20   Max.   :30.60  
 NA's   :10821   NA's   :10806      NA's   :10918   NA's   :11148  
  Lim Chu Kang   Marine Parade   Choa Chu Kang (Central)   Tuas South   
 Min.   : NA     Min.   : NA     Min.   : NA             Min.   :23.10  
 1st Qu.: NA     1st Qu.: NA     1st Qu.: NA             1st Qu.:27.40  
 Median : NA     Median : NA     Median : NA             Median :28.20  
 Mean   :NaN     Mean   :NaN     Mean   :NaN             Mean   :28.19  
 3rd Qu.: NA     3rd Qu.: NA     3rd Qu.: NA             3rd Qu.:29.00  
 Max.   : NA     Max.   : NA     Max.   : NA             Max.   :31.00  
 NA's   :16071   NA's   :16071   NA's   :16071           NA's   :11609  
 Pasir Panjang   Jurong Island   Nicoll Highway  Botanic Garden 
 Min.   :23.20   Min.   :23.40   Min.   : NA     Min.   : NA    
 1st Qu.:27.50   1st Qu.:27.60   1st Qu.: NA     1st Qu.: NA    
 Median :28.30   Median :28.40   Median : NA     Median : NA    
 Mean   :28.25   Mean   :28.29   Mean   :NaN     Mean   :NaN    
 3rd Qu.:29.10   3rd Qu.:29.10   3rd Qu.: NA     3rd Qu.: NA    
 Max.   :31.30   Max.   :31.10   Max.   : NA     Max.   : NA    
 NA's   :11049   NA's   :11748   NA's   :16071   NA's   :16071  
 Choa Chu Kang (South)    Whampoa          Changi       Jurong Pier   
 Min.   :22.70         Min.   : NA     Min.   :22.80   Min.   : NA    
 1st Qu.:26.80         1st Qu.: NA     1st Qu.:26.90   1st Qu.: NA    
 Median :27.70         Median : NA     Median :27.70   Median : NA    
 Mean   :27.68         Mean   :NaN     Mean   :27.69   Mean   :NaN    
 3rd Qu.:28.60         3rd Qu.: NA     3rd Qu.:28.60   3rd Qu.: NA    
 Max.   :31.00         Max.   : NA     Max.   :30.90   Max.   : NA    
 NA's   :11558         NA's   :16071   NA's   :731     NA's   :16071  
   Ulu Pandan        Mandai         Tai Seng     Jurong (West)  
 Min.   : NA     Min.   : NA     Min.   :23.2    Min.   :22.20  
 1st Qu.: NA     1st Qu.: NA     1st Qu.:27.6    1st Qu.:26.60  
 Median : NA     Median : NA     Median :28.4    Median :27.40  
 Mean   :NaN     Mean   :NaN     Mean   :28.4    Mean   :27.41  
 3rd Qu.: NA     3rd Qu.: NA     3rd Qu.:29.3    3rd Qu.:28.30  
 Max.   : NA     Max.   : NA     Max.   :31.5    Max.   :30.60  
 NA's   :16071   NA's   :16071   NA's   :11454   NA's   :10995  
    Clementi     Sentosa Island  Bukit Panjang   Kranji Reservoir
 Min.   :22.80   Min.   :23.00   Min.   : NA     Min.   : NA     
 1st Qu.:26.90   1st Qu.:27.40   1st Qu.: NA     1st Qu.: NA     
 Median :27.70   Median :28.20   Median : NA     Median : NA     
 Mean   :27.63   Mean   :28.12   Mean   :NaN     Mean   :NaN     
 3rd Qu.:28.50   3rd Qu.:28.90   3rd Qu.: NA     3rd Qu.: NA     
 Max.   :30.60   Max.   :31.10   Max.   : NA     Max.   : NA     
 NA's   :11258   NA's   :11317   NA's   :16071   NA's   :16071   
 Upper Peirce Reservoir   Kent Ridge      Queenstown    Tanjong Katong 
 Min.   : NA            Min.   : NA     Min.   : NA     Min.   : NA    
 1st Qu.: NA            1st Qu.: NA     1st Qu.: NA     1st Qu.: NA    
 Median : NA            Median : NA     Median : NA     Median : NA    
 Mean   :NaN            Mean   :NaN     Mean   :NaN     Mean   :NaN    
 3rd Qu.: NA            3rd Qu.: NA     3rd Qu.: NA     3rd Qu.: NA    
 Max.   : NA            Max.   : NA     Max.   : NA     Max.   : NA    
 NA's   :16071          NA's   :16071   NA's   :16071   NA's   :16071  
 Somerset (Road)    Punggol          Simei         Toa Payoh    
 Min.   : NA     Min.   : NA     Min.   : NA     Min.   : NA    
 1st Qu.: NA     1st Qu.: NA     1st Qu.: NA     1st Qu.: NA    
 Median : NA     Median : NA     Median : NA     Median : NA    
 Mean   :NaN     Mean   :NaN     Mean   :NaN     Mean   :NaN    
 3rd Qu.: NA     3rd Qu.: NA     3rd Qu.: NA     3rd Qu.: NA    
 Max.   : NA     Max.   : NA     Max.   : NA     Max.   : NA    
 NA's   :16071   NA's   :16071   NA's   :16071   NA's   :16071  
      Tuas        Bukit Timah    Pasir Ris (Central)
 Min.   : NA     Min.   : NA     Min.   : NA        
 1st Qu.: NA     1st Qu.: NA     1st Qu.: NA        
 Median : NA     Median : NA     Median : NA        
 Mean   :NaN     Mean   :NaN     Mean   :NaN        
 3rd Qu.: NA     3rd Qu.: NA     3rd Qu.: NA        
 Max.   : NA     Max.   : NA     Max.   : NA        
 NA's   :16071   NA's   :16071   NA's   :16071      

We will make use of plotly to explore the missing daily temperatures for each station using a dropdown list.

Show the code
plot_ly(data = temp_mean_wide, 
        x = ~tdate, 
        y = ~ Admiralty, 
        type = "scatter", 
        mode = "lines") |> 
  layout(title = "Temperature observed by Weather Stations", 
       xaxis = list(title = "Date"), 
       yaxis = list(title = "", range = c(0,40)), 
      theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,  
                 axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),  
       updatemenus = list(list(type = 'dropdown', 
                               xref = "paper", 
                               yref = "paper", 
                               xanchor = "left",
                               x = 0.04,
                               y = 0.95, 
                               buttons = list(
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$Admiralty)), 
                                                    list(yaxis = list(title = "Temperature in Admiralty", range = c(0,40)))),label = "Admiralty"),
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`East Coast Parkway`)), 
                                                    list(yaxis = list(title = "Temperature in East Coast Parkway", range = c(0,40)))),label = "East Coast Parkway"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`Ang Mo Kio`)), 
                                                    list(yaxis = list(title = "Temperature in Ang Mo Kio", range = c(0,40)))),label = "Ang Mo Kio"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$Newton)), 
                                                    list(yaxis = list(title = "Temperature in Newton", range = c(0,40)))),label = "Newton"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`Tuas South`)), 
                                                    list(yaxis = list(title = "Temperature in Tuas South", range = c(0,40)))),label = "Tuas South"),
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`Pasir Panjang`)), 
                                                    list(yaxis = list(title = "Temperature in Pasir Panjang", range = c(0,40)))),label = "Pasir Panjang"), 
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`Jurong Island`)), 
                                                    list(yaxis = list(title = "Temperature in Jurong Island", range = c(0,40)))),label = "Jurong Island"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (South)`)), 
                                                    list(yaxis = list(title = "Temperature in Choa Chu Kang (South)", range = c(0,40)))),label = "Choa Chu Kang (South)"), 
                                 list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$Changi)), 
                                                    list(yaxis = list(title = "Temperature in Changi", range = c(0,40)))),label = "Changi"),
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`Tai Seng`)), 
                                                    list(yaxis = list(title = "Temperature in Tai Seng", range = c(0,40)))),label = "Tai Seng"),
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_wide$`Jurong (West)`)), 
                                                    list(yaxis = list(title = "Temperature in Jurong West", range = c(0,40)))),label = "Jurong West"), 
                                   list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$Clementi)), 
                                                    list(yaxis = list(title = "Temperature  in Clementi", range = c(0,40)))),label = "Clementi"), 
                                   list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Sentosa Island`)), 
                                                    list(yaxis = list(title = "Temperature  in Sentosa", range = c(0,40)))),label = "Sentosa"), 
                                 list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Macritchie Reservoir`)), 
                                                    list(yaxis = list(title = "Temperature  at Macritchie Reservoir", range = c(0,40)))),label = "Macritchie Reservoir"), 
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Lower Peirce Reservoir`)), 
                                                    list(yaxis = list(title = "Temperature  at Lower Peirce Reservoir", range = c(0,40)))),label = "Lower Peirce Reservoir"),
                                 list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Lim Chu Kang`)), 
                                                    list(yaxis = list(title = "Temperature at Lim Chu Kang", range = c(0,40)))),label = "Lim Chu Kang"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Marine Parade`)), 
                                                    list(yaxis = list(title = "Temperature at Marine Parade", range = c(0,40)))),label = "Marine Parade"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (Central)`)), 
                                                    list(yaxis = list(title = "Temperature at Choa Chu Kang (Central)", range = c(0,40)))),label = "Choa Chu Kang (Central)"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Choa Chu Kang (Central)`)), 
                                                    list(yaxis = list(title = "Temperature at Choa Chu Kang (Central)", range = c(0,40)))),label = "Choa Chu Kang (Central)"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Nicoll Highway`)), 
                                                    list(yaxis = list(title = "Temperature at Nicoll Highway", range = c(0,40)))),label = "Nicoll Highway"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Botanic Garden`)), 
                                                    list(yaxis = list(title = "Temperature at Botanic Garden", range = c(0,40)))),label = "Botanic Garden"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$Whampoa)), 
                                                    list(yaxis = list(title = "Temperature at Whampoa", range = c(0,40)))),label = "Whampoa"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Jurong Pier`)), 
                                                    list(yaxis = list(title = "Temperature at Jurong Pier", range = c(0,40)))),label = "Jurong Pier"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Ulu Pandan`)), 
                                                    list(yaxis = list(title = "Temperature at Ulu Pandan", range = c(0,40)))),label = "Ulu Pandan"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$Mandai)), 
                                                    list(yaxis = list(title = "Temperature at Mandai", range = c(0,40)))),label = "Mandai"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Bukit Panjang`)), 
                                                    list(yaxis = list(title = "Temperature at Bukit Panjang", range = c(0,40)))),label = "Bukit Panjang"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Kranji Reservoir`)), 
                                                    list(yaxis = list(title = "Temperature at Kranji Reservoir", range = c(0,40)))),label = "Kranji Reservoir"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Upper Peirce Reservoir`)), 
                                                    list(yaxis = list(title = "Temperature at Upper Peirce Reservoir", range = c(0,40)))),label = "Upper Peirce Reservoir"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Kent Ridge`)), 
                                                    list(yaxis = list(title = "Temperature at Kent Ridge", range = c(0,40)))),label = "Kent Ridge"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$Queenstown)), 
                                                    list(yaxis = list(title = "Temperature at Queenstown", range = c(0,40)))),label = "Queenstown"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Tanjong Katong`)), 
                                                    list(yaxis = list(title = "Temperature at Tanjong Katong", range = c(0,40)))),label = "Tanjong Katong"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Somerset (Road)`)), 
                                                    list(yaxis = list(title = "Temperature at Somerset (Road)", range = c(0,40)))),label = "Somerset (Road)"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Punggol`)), 
                                                    list(yaxis = list(title = "Temperature at Punggol", range = c(0,40)))),label = "Punggol"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Simei`)), 
                                                    list(yaxis = list(title = "Temperature at Simei", range = c(0,40)))),label = "Simei"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Toa Payoh`)), 
                                                    list(yaxis = list(title = "Temperature at Toa Payoh", range = c(0,40)))),label = "Toa Payoh"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Tuas`)), 
                                                    list(yaxis = list(title = "Temperature at Tuas", range = c(0,40)))),label = "Tuas"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Bukit Timah`)), 
                                                    list(yaxis = list(title = "Temperature at Bukit Timah", range = c(0,40)))),label = "Bukit Timah"),
                                list(method = "update", 
                                        args = list(list(y = list(temp_mean_wide$`Pasir Ris (Central)`)), 
                                                    list(yaxis = list(title = "Temperature at Pasir Ris (Central)", range = c(0,40)))),label = "Pasir Ris (Central)")
                               ))))  
Observations
  • It seems like there are some weather stations with no temperature data at all. We should remove them from the temp dataframe.

  • There are some weather stations (e.g. Admiralty) have temperature data only from a certain year onwards (e.g. 2009), and some stations (e.g. Changi) have temperature data as early as 1980s.

  • For those weather stations with temperature data, they also have missing values over a given time period. So we will need to decide how to handle these missing values in subsequent sections.

Let us identify the amount of missing values for each weather station using the following code chunk.

Show the code
missing_values <- temp_mean_wide %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)

levels <-
    (missing_values  %>% filter(isna == T) %>% arrange(desc(pct)))$key

percentage_plot <- missing_values %>%
      ggplot() +
        geom_bar(aes(x = reorder(key, desc(pct)), 
                     y = pct, fill=isna), 
                 stat = 'identity', alpha=0.8) +
      scale_x_discrete(limits = levels) +
      scale_fill_manual(name = "", 
                        values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
      coord_flip() +
      labs(title = "Percentage of missing values", x =
             'Variable', y = "% of missing values")

percentage_plot

The above output is consistent with what we observed when exploring the data using plotly. There are numerous stations without temperature readings throughout all years and there are certain stations with temperature readings during certain time periods.

Let us find out which stations that have no temperature readings throughout the entire time period using filter().We will filter out those weather stations that have 100% NAs.

Show the code
notempdata <- missing_values %>%
  filter(isna == TRUE & pct==100)

notempdata$key
 [1] "Botanic Garden"          "Bukit Panjang"          
 [3] "Bukit Timah"             "Choa Chu Kang (Central)"
 [5] "Jurong Pier"             "Kent Ridge"             
 [7] "Kranji Reservoir"        "Lim Chu Kang"           
 [9] "Lower Peirce Reservoir"  "Macritchie Reservoir"   
[11] "Mandai"                  "Marine Parade"          
[13] "Nicoll Highway"          "Pasir Ris (Central)"    
[15] "Punggol"                 "Queenstown"             
[17] "Simei"                   "Somerset (Road)"        
[19] "Tanjong Katong"          "Toa Payoh"              
[21] "Tuas"                    "Ulu Pandan"             
[23] "Upper Peirce Reservoir"  "Whampoa"                

From the above output, we know that these 24 weather stations have no temperature readings. We will put them into a list and create an operator to exclude them from the temp data using filter().

Show the code
stationstoremove <- c("Botanic Garden","Bukit Panjang","Bukit Timah","Choa Chu Kang (Central)","Jurong Pier","Kent Ridge", "Kranji Reservoir", "Lim Chu Kang", "Lower Peirce Reservoir", "Macritchie Reservoir","Mandai", "Marine Parade","Nicoll Highway", "Pasir Ris (Central)", "Punggol", "Queenstown","Simei", "Somerset (Road)","Tanjong Katong", "Toa Payoh", "Tuas", "Ulu Pandan", "Upper Peirce Reservoir","Whampoa")

#create a operator to exclude things 
'%!in%' <- function(x,y)!('%in%'(x,y))

#excluded stations that have no temp data at all 
temp_clean <- temp %>%
  filter(station %!in% stationstoremove)

glimpse(temp_clean)
Rows: 120,139
Columns: 5
$ station             <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ tdate               <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-04, 2…
$ mean_temperature    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maximum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ minimum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Show the code
#write it into RDS for future usage, esp when building 
write_rds(temp_clean, "data/temp_clean.rds")
unique(temp_clean$station)
 [1] "Admiralty"             "East Coast Parkway"    "Ang Mo Kio"           
 [4] "Newton"                "Tuas South"            "Pasir Panjang"        
 [7] "Jurong Island"         "Choa Chu Kang (South)" "Changi"               
[10] "Tai Seng"              "Jurong (West)"         "Clementi"             
[13] "Sentosa Island"       

We will then pivot the temp_clean dataframe wider and plot the daily temperature for the remaining weather stations using plotly.

Show the code
temp_mean_widec <- temp_clean %>%
  select(tdate, station, mean_temperature) %>%
  pivot_wider(names_from = station, values_from = mean_temperature)

glimpse(temp_mean_widec)
Rows: 16,071
Columns: 14
$ tdate                   <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-0…
$ Admiralty               <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `East Coast Parkway`    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Ang Mo Kio`            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Newton                  <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Tuas South`            <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Pasir Panjang`         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Jurong Island`         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Choa Chu Kang (South)` <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Changi                  <dbl> 26.6, 26.4, 26.5, 26.3, 27.0, 27.4, 27.1, 27.0…
$ `Tai Seng`              <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Jurong (West)`         <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ Clementi                <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ `Sentosa Island`        <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
Show the code
plot_ly(data = temp_mean_widec, 
        x = ~tdate, 
        y = ~ Admiralty, 
        type = "scatter", 
        mode = "lines+markers") |> 
  layout(title = "Temperature observed by Weather Station", 
       xaxis = list(title = "Date"), 
       yaxis = list(title = "", range = c(0,40)), 
      theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,  
                 axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),  
       updatemenus = list(list(type = 'dropdown', 
                               xref = "paper", 
                               yref = "paper", 
                               xanchor = "left",
                               x = 0.04,
                               y = 0.95, 
                               buttons = list(
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$Admiralty)), 
                                                    list(yaxis = list(title = "Temperature in Admiralty", range = c(0,40)))),label = "Admiralty"),
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`East Coast Parkway`)), 
                                                    list(yaxis = list(title = "Temperature in East Coast Parkway", range = c(0,40)))),label = "East Coast Parkway"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`Ang Mo Kio`)), 
                                                    list(yaxis = list(title = "Temperature in Ang Mo Kio", range = c(0,40)))),label = "Ang Mo Kio"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$Newton)), 
                                                    list(yaxis = list(title = "Temperature in Newton", range = c(0,40)))),label = "Newton"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`Tuas South`)), 
                                                    list(yaxis = list(title = "Temperature in Tuas South", range = c(0,40)))),label = "Tuas South"),
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`Pasir Panjang`)), 
                                                    list(yaxis = list(title = "Temperature in Pasir Panjang", range = c(0,40)))),label = "Pasir Panjang"), 
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`Jurong Island`)), 
                                                    list(yaxis = list(title = "Temperature in Jurong Island", range = c(0,40)))),label = "Jurong Island"), 
                                 list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`Choa Chu Kang (South)`)), 
                                                    list(yaxis = list(title = "Temperature in Choa Chu Kang", range = c(0,40)))),label = "Choa Chu Kang"), 
                                 list(method = "update", 
                                        args = list(list(y = list(temp_mean_widec$Changi)), 
                                                    list(yaxis = list(title = "Temperature in Changi", range = c(0,40)))),label = "Changi"),
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`Tai Seng`)), 
                                                    list(yaxis = list(title = "Temperature in Tai Seng", range = c(0,40)))),label = "Tai Seng"),
                                  list(method = "update",
                                      args = list(list(y = list(temp_mean_widec$`Jurong (West)`)), 
                                                    list(yaxis = list(title = "Temperature in Jurong West", range = c(0,40)))),label = "Jurong West"), 
                                   list(method = "update", 
                                        args = list(list(y = list(temp_mean_widec$Clementi)), 
                                                    list(yaxis = list(title = "Temperature  in Clementi", range = c(0,40)))),label = "Clementi"), 
                                   list(method = "update", 
                                        args = list(list(y = list(temp_mean_widec$`Sentosa Island`)), 
                                                    list(yaxis = list(title = "Temperature  in Sentosa", range = c(0,40)))),label = "Sentosa")
                                   
                               ))))  
Observations
  • From the above interactive chart, we note that some stations have a longer time period with temperature readings (e.g. Changi). Almost all stations have some missing time gaps/ values in the data, hence we will need to do imputation for this missing values to ensure better accuracy of our temperature forecasting.

  • Also, we noted that daily temperature readings that range more than 20 years is too frequent for time series forecasting. Hence, we will aggregate the daily temperature readings to monthly temperature readings by calculating the mean in subsequent section.

3.3 Creating Time Series Object

In the previous sections, we noted that the dataframes were all tibble dataframes. For us to make use of the time series forecasting packages and their functions, we would need to convert the tibble dataframe into a time series object.

Before we create the time series object, let us first aggregate the daily temperature readings to monthly temperature readings by (1) creating the year-month column for each observation using floor_date() and specifying it to derive the year and month of each observation, and (2) aggregate the temperature readings by station and year_month then use summarise() to compute the monthly averages for mean_temperature, maximum_temperature and minimum_temperature.

Show the code
#create year-month col
temp_clean$year_month <- floor_date(temp_clean$tdate, "month")
glimpse(temp_clean)
Rows: 120,139
Columns: 6
$ station             <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ tdate               <date> 2009-01-01, 2009-01-02, 2009-01-03, 2009-01-04, 2…
$ mean_temperature    <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ maximum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ minimum_temperature <dbl> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA…
$ year_month          <date> 2009-01-01, 2009-01-01, 2009-01-01, 2009-01-01, 2…
Show the code
monthly_temp <- temp_clean %>%                         
  group_by(station, year_month) %>% 
  summarise(across(c(mean_temperature, maximum_temperature, minimum_temperature), mean))

glimpse(monthly_temp)
Rows: 3,947
Columns: 5
Groups: station [13]
$ station             <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty"…
$ year_month          <date> 2009-01-01, 2009-02-01, 2009-03-01, 2009-04-01, 2…
$ mean_temperature    <dbl> NA, 26.76786, NA, 28.12000, 28.48387, 28.89667, 28…
$ maximum_temperature <dbl> NA, 31.44286, NA, 32.19667, 32.59032, 32.87000, 31…
$ minimum_temperature <dbl> NA, 24.26071, NA, 25.06667, 25.09355, 25.95000, 25…
Show the code
write_rds(monthly_temp, "data/monthly_temp.rds")

With the monthly temperature of all weather stations, let us filter out one weather station (e.g. Admiralty) to create a tibble data frame adm so that we can convert it into an xts object, which is a type of time series object.

Show the code
monthly_temp <- read_rds("data/monthly_temp.rds")

#filter out Admiralty weather station 
adm <- monthly_temp %>%
  filter(station == "Admiralty")

#check the resultant dataframe
summary(adm)
   station            year_month         mean_temperature maximum_temperature
 Length:179         Min.   :2009-01-01   Min.   :25.61    Min.   :28.86      
 Class :character   1st Qu.:2012-09-16   1st Qu.:27.10    1st Qu.:31.33      
 Mode  :character   Median :2016-07-01   Median :27.74    Median :31.85      
                    Mean   :2016-06-19   Mean   :27.68    Mean   :31.83      
                    3rd Qu.:2020-03-16   3rd Qu.:28.29    3rd Qu.:32.45      
                    Max.   :2023-12-01   Max.   :29.15    Max.   :33.87      
                                         NA's   :22       NA's   :18         
 minimum_temperature
 Min.   :23.63      
 1st Qu.:24.52      
 Median :24.92      
 Mean   :24.98      
 3rd Qu.:25.46      
 Max.   :26.45      
 NA's   :18         

We will use xts() from xts package to create a time series object. The order.by parameter uses the dates from the adm dataframe. We then use the ts_regular() function to give the time series object adm_xts a regular interval by adding NA values for missing dates.

Just in case there are missing months which we did not detected, we use the na.fill() function fills in those missing dates by extending values from previous days.

Show the code
adm_xts <- xts(adm[,c("mean_temperature", "maximum_temperature", "minimum_temperature")], order.by=as.Date(adm$year_month))
adm_xts<- ts_regular(adm_xts)
adm_xts <- na.fill(adm_xts, "extend")

Let us plot out the monthly mean temperature of Admiralty weather station using ggplotly.

Show the code
p1 <- ggplot(adm_xts, aes(x = Index, y = mean_temperature)) + 
  geom_line() + theme_clean() +
  labs(title = "Monthly Mean Temperature of Admiralty Weather Station", caption = "Data from Weather.gov.sg") +
  xlab("Month-Year") +
  ylab("Temperature in degrees celsius") +
  theme_ipsum_rc()

ggplotly(p1)

From the above output, we see that there are missing temperatures for numerous time periods. As a result, the line for the above chart is not continuous.

Let us investigate further using imputeTS package’s ggplot_na_distribution, which highlights the missing values in our data. For the following example, we focus on the mean temperature of the adm time series object.

Show the code
ggplot_na_distribution(x = adm$mean_temperature,
                       x_axis_labels = adm$year_month,
                       ylab = "Temperature in degrees celsius")

We also use the imputeTS package’s statsNA to have a report on the number of missing mean temperature readings.

Show the code
statsNA(adm_xts$mean_temperature)
[1] "Length of time series:"
[1] 180
[1] "-------------------------"
[1] "Number of Missing Values:"
[1] 23
[1] "-------------------------"
[1] "Percentage of Missing Values:"
[1] "12.8%"
[1] "-------------------------"
[1] "Number of Gaps:"
[1] 12
[1] "-------------------------"
[1] "Average Gap Size:"
[1] 1.916667
[1] "-------------------------"
[1] "Stats for Bins"
[1] "  Bin 1 (45 values from 1 to 45) :      3 NAs (6.67%)"
[1] "  Bin 2 (45 values from 46 to 90) :      4 NAs (8.89%)"
[1] "  Bin 3 (45 values from 91 to 135) :      11 NAs (24.4%)"
[1] "  Bin 4 (45 values from 136 to 180) :      5 NAs (11.1%)"
[1] "-------------------------"
[1] "Longest NA gap (series of consecutive NAs)"
[1] "6 in a row"
[1] "-------------------------"
[1] "Most frequent gap size (series of consecutive NA series)"
[1] "1 NA in a row (occurring 7 times)"
[1] "-------------------------"
[1] "Gap size accounting for most NAs"
[1] "1 NA in a row (occurring 7 times, making up for overall 7 NAs)"
[1] "-------------------------"
[1] "Overview NA series"
[1] "  1 NA in a row: 7 times"
[1] "  2 NA in a row: 2 times"
[1] "  3 NA in a row: 2 times"
[1] "  6 NA in a row: 1 times"

3.3.1 Missing Value Imputation

There are several ways to impute missing data in time series objects. We need to impute missing values because some of the models cannot handle NAs in Time Series objects.

3.3.1.1 Moving Averages

As this function calculates moving averages based on the last n observations, it will generally be performing better than using mean, mode and median imputation. Moving averages work well when data has a linear trend. This function also allows us to use linear-weighted and exponentially-weighted moving averages.

Show the code
adm_imp_movingavg <- na_ma(adm_xts, weighting = "exponential") #default is exponential. Other options are "simple" and "linear". We can allow users to choose if the option they want. 

#plot chart 
#ggplot(adm_imp_movingavg, aes(x = Index, y = mean_temperature)) + 
  #geom_line()

plot_ma<- ggplot(adm_imp_movingavg, aes(x = Index, y = mean_temperature)) + 
  geom_line() + theme_clean() +
  labs(title = "Monthly Mean Temperature of Admiralty Weather Station \n(missing values imputed using moving average)") +
  xlab("Month-Year") +
  ylab("Temperature in degrees celsius") +
  theme_ipsum_rc()

ggplotly(plot_ma)

3.3.1.2 Kalman smoothing

We can also use Kalman Smoothing on ARIMA model to impute the missing values.

Show the code
adm_imp_kalman <- na_kalman(adm_xts, model = "auto.arima")

#plot chart 

ggplot(adm_imp_kalman, aes(x = Index, y = mean_temperature)) + 
  geom_line()

From the above output, we see that some of the imputed values are below 0 degrees celsius which is impossible in Singapore. As such, we will not be using this method to impute missing values for temperature readings.

Kalman Smoothing also has a “StrucTS” option. Let us try and see how it works for our temperature data.

Show the code
adm_imp_kalman_ts <- na_kalman(adm_xts, model = "StructTS")

#plot chart 

ggplot(adm_imp_kalman_ts, aes(x = Index, y = mean_temperature)) + 
  geom_line()

From the above output, it seems like using the “StrucTS” model works better than auto.arima model since the imputed results were reasonable. Again, we can also let users choose which model they want to use.

3.4 Testing if the time series is stationary

Before we model the time series forecasting model, let us test is our time series data is stationary. Stationarity signifies that the statistical properties of time series, such as mean, variance, and covariance, remain constant over time, which is the fundamental assumption for many time series modeling techniques.It simplifies the complex dynamics within the data, making it more amenable to analysis, modeling, and forecasting.

There are two tests we are use to test for stationarity: - Augmented Dickey-Fuller (ADF) Test; and - Kwiatkowski-Phillips-Schmidt-Shin (KPSS) Test

3.4.1 Augmented Dickey-Fuller Test

Null Hypothesis: Series is non-stationary, or series has a unit root. Alternative Hypothesis: Series is stationary, or series has no unit root.

If the null hypothesis fails to be rejected, this test may provide evidence that the series is non-stationary.

Show the code
adf.test(adm_imp_movingavg$mean_temperature)

    Augmented Dickey-Fuller Test

data:  adm_imp_movingavg$mean_temperature
Dickey-Fuller = -7.3405, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

Since the p-value is 0.01, which is less than critical value of 0.05, we reject the null hypothesis. This means that the time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.

3.4.2 Kwiatkowski-Phillips-Schmidt-Shin Test

Null Hypothesis: Series is trend stationary or series has no unit root. Alternative Hypothesis: Series is non-stationary, or series has a unit root.

Note

Note: The hypothesis is reversed in the KPSS test compared to ADF Test.

Show the code
kpss.test(adm_imp_movingavg$mean_temperature)

    KPSS Test for Level Stationarity

data:  adm_imp_movingavg$mean_temperature
KPSS Level = 0.086074, Truncation lag parameter = 4, p-value = 0.1

Since the p-value is 0.1, which is greater than the critical value of 0.05, we fail to reject the null hypothesis of the KPSS test.This means we can assume that the time series is trend stationary.

Both ADF and KPSS tests conclude that the given series is stationary. This means that we can make use of most of the time series forecasting models such as Exponential Smoothing and ARIMA.

3.5 Decomposition of Time Series Object

Time series data can exhibit a variety of patterns, and it is often helpful to split a time series into several component to help us improve our understanding of the time series and forecast accuracy.

First, let us plot the monthly mean temperature of the Admiralty weather station.

Show the code
p2 <- ggplot(adm_imp_movingavg, aes(x = Index, y = mean_temperature)) + 
  geom_line() + 
  geom_smooth(method=lm) 

ggplotly(p2)
Note

From the above output, it seems like there were fluctuations in monthly mean temperature but there was no increasing trend.

It also seems like for each year, the mean temperature would usually be the highest in May/Jun of each year as indicated by the peaks. Also, for each year, the lowest mean temperature would usually be around Dec/ Jan.

To find out if there is a seasonality, trend and cycle, we can decompose a time series object using stl()from xts package. STL is a versatile and robust method for decomposing time series. STL is an acronym for “Seasonal and Trend decomposition using Loess”, while Loess is a method for estimating nonlinear relationships. The STL method was developed by R. B. Cleveland, Cleveland, McRae, & Terpenning (1990).

In the following code chunk, we use: - stl() to decompose the time series object - ts_ts() function from the library converts an xts field to a ts object that can be used with stl().

Show the code
adm_decomposition <- stl(ts_ts(adm_imp_movingavg$mean_temperature), s.window = "periodic")

## plot out the decomposition results 
autoplot(adm_decomposition)+ 
  ggtitle("Decomposition for Monthly Mean Temperature") +
  xlab("Month-Year") + 
  theme_clean()

Show the code
summary(adm_decomposition)
 Call:
 stl(x = ts_ts(adm_imp_movingavg$mean_temperature), s.window = "periodic")

 Time.series components:
    seasonal              trend            remainder         
 Min.   :-0.9176109   Min.   :27.30177   Min.   :-0.9341622  
 1st Qu.:-0.5109855   1st Qu.:27.43056   1st Qu.:-0.2329592  
 Median : 0.1857624   Median :27.65282   Median :-0.0124042  
 Mean   : 0.0000000   Mean   :27.66756   Mean   :-0.0019506  
 3rd Qu.: 0.4489919   3rd Qu.:27.85388   3rd Qu.: 0.2320894  
 Max.   : 0.7298073   Max.   :28.19072   Max.   : 0.9402058  
 IQR:
     STL.seasonal STL.trend STL.remainder data  
     0.9600       0.4233    0.4650        1.0218
   %  94.0         41.4      45.5         100.0 

 Weights: all == 1

 Other components: List of 5
 $ win  : Named num [1:3] 1801 19 13
 $ deg  : Named int [1:3] 0 1 1
 $ jump : Named num [1:3] 181 2 2
 $ inner: int 2
 $ outer: int 0
Note
  • From the above output, we noted that there is no clear linear trend for the monthly mean temperature over the years. However, we do note that the monthly mean temperature ranges from 27.3 degrees celsius to ~28.2 degree celsius.

  • We observed seasonality in the monthly mean temperature over the years.

3.6 Building Models

First we will split the data into training and validation data.

When choosing models, it is common practice to separate the available data into two portions, training and test data, where the training data is used to estimate any parameters of a forecasting method and the test data is used to evaluate its accuracy. Because the test data is not used in determining the forecasts, it should provide a reliable indication of how well the model is likely to forecast on new data.

The size of the test set is typically about 20% of the total sample, although this value depends on how long the sample is and how far ahead you want to forecast. The test set should ideally be at least as large as the maximum forecast horizon required.

For this section, we will set the test set ~20% of the dataframe we have. However when building the Shiny dashboard, we should allow user input on the duration they want to forecast (e.g. next few months or next few years) because this would affect the size of the test dataset.

Show the code
# create samples 
trainingtemp <- ts_ts(head(adm_imp_movingavg$mean_temperature, (length(adm_imp_movingavg$mean_temperature)-24))) 
validationtemp <- ts_ts(tail(adm_imp_movingavg$mean_temperature, 24)) # we are going to predict the 

3.6.1 Benchmark models

Some forecasting methods are extremely simple and surprisingly effective. We will use the following two forecasting methods (i.e. naive and seasonal naive) as benchmarks.

3.6.1.1 Naive method

For naïve forecasts, we simply set all forecasts to be the value of the last observation.

naive_model <- naive(trainingtemp, h = length(validationtemp))
datatable(data.frame(naive_model))

Mean absolute percentage error (MAPE) is the percentage equivalent of mean absolute error (MAE). Mean absolute percentage error measures the average magnitude of error produced by a model, or how far off predictions are on average.

To measure the performance of how well the model’s forecasted values as compared to the test dataset, we use MAPE() of MLmetrics package to calculate the MAPE.

MAPE(naive_model$mean, validationtemp) * 100
[1] 2.79179

From the above output, we have a MAPE of 2.79% meaning that the average difference between the forecasted value and the actual value is 2.79%.For a simple model, this forecasting accuracy is very good! It means that other models introduced would need to have a even lower MAPE in order for us to consider them.

autoplot(naive_model) +
  ggtitle("Naive Forecasts for Monthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  theme_ipsum_rc()

3.6.1.2 Seasonal naive method

A similar method is useful for highly seasonal data. In this case, we set each forecast to be equal to the last observed value from the same season (e.g.,the same month of the previous year).

snaive_model <- snaive(trainingtemp, h = length(validationtemp))
datatable(data.frame(snaive_model))
MAPE(snaive_model$mean, validationtemp) * 100
[1] 2.06697

From the above output, we have a MAPE of 2.07% meaning that the average difference between the forecasted value and the actual value is 2.07%.For a simple model, this forecasting accuracy is even better than the naive model! It means that other models introduced would need to have a even lower MAPE in order for us to consider them.

autoplot(trainingtemp) +
  autolayer(snaive(trainingtemp, h = length(validationtemp),
                   series="Seasonal Naive", PI=FALSE)) +
  ggtitle("Seasonal Naive Forecasts for Monthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  hrbrthemes::theme_ipsum_rc()

3.6.2 Exponential Smoothing Methods

3.6.2.1 Simple exponential smoothing

The simplest of the exponentially smoothing methods is naturally called simple exponential smoothing (SES). This method is suitable for forecasting data with no clear trend or seasonal pattern.

ses_modelT <- ses(trainingtemp, h = length(validationtemp))
datatable(data.frame(ses_modelT))
MAPE(ses_modelT$mean, validationtemp) * 100
[1] 2.784014

From the above output, we have a MAPE of 2.78% meaning that the average difference between the forecasted value and the actual value is 2.78%, which is slightly better than the naive model and poorer performance than the seasonal naive model.

autoplot(ses_modelT) +
  ggtitle("Simple exponential smoothing Forecasts for \nMonthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  theme_ipsum_rc()

This output resembles the results from the naive model.

3.6.2.2 State Space Model

State space models provide a flexible framework for modeling time series data. They consist of two components: the state equation and the observation equation. The state equation describes how the underlying states of the system evolve over time, while the observation equation relates the observed data to the underlying states.

State space models allow us to capture complex dynamics and dependencies in the data, making them suitable for a wide range of applications, including finance, economics and engineering.

We use ets() from forecast package to find out the optimal model.

ets_modelT <- ets(trainingtemp)
summary(ets_modelT)
ETS(M,N,A) 

Call:
 ets(y = trainingtemp) 

  Smoothing parameters:
    alpha = 0.2716 
    gamma = 1e-04 

  Initial states:
    l = 27.8746 
    s = -0.8262 -0.4477 0.1394 0.235 0.4386 0.5042
           0.6772 0.6073 0.3135 -0.0647 -0.631 -0.9455

  sigma:  0.0158

     AIC     AICc      BIC 
544.5750 548.0035 590.3228 

Training set error measures:
                       ME      RMSE      MAE         MPE     MAPE      MASE
Training set -0.005080769 0.4168655 0.336926 -0.03791676 1.219642 0.7034614
                  ACF1
Training set 0.1037638

We see ETS (M,N,A). This means we have an ets model with multiplicative errors, no trend and a additive seasonality. Additive seasonality means there aren’t any changes to widths or heights of seasonal periods over time.

ets_forecastT <- forecast(ets_modelT, h=length(validationtemp))
datatable(data.frame(ets_forecastT))
MAPE(ets_forecastT$mean, validationtemp) *100
[1] 1.471135

From the above output, we have a MAPE of 1.47% meaning that the average difference between the forecasted value and the actual value is 1.47%, which is so far the best performing model.

autoplot(ets_forecastT) +
  ggtitle("ETS Forecasts for Monthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  theme_ipsum_rc()

3.6.2.3 Holt-Winters

Since time series analysis decomposes past weather observations into seasonal, trend, and random components, that data can be used to create forecasts based on an assumption that the observed patterns will continue into the future. This type of forecasting is especially useful in business for anticipating potential demand for seasonal products.

To forecast future temperatures based on historical observations, we can use Holt-Winters model that considers past seasonal cycles, trends, and random variation.

Note that for Holt-Winters’ method, we can choose additive or multiplicative seasonality to forecast the monthly mean temperature, which can be part of the user input.

3.6.2.3.1 Additive Seasonality
hw_modela <- hw(trainingtemp, h = length(validationtemp), seasonal = "additive") 
datatable(data.frame(hw_modela))
MAPE(hw_modela$mean, validationtemp)*100
[1] 1.467264

From the above output, we have a MAPE of 1.47% meaning that the average difference between the forecasted value and the actual value is 1.47%, which gives us as good performance as the State Space Model.

autoplot(hw_modela) +
  ggtitle("Holt-Winters (Additive Seasonality) Forecasts for \nMonthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  theme_ipsum_rc()

3.6.2.3.2 Multiplicative Seasonality
hw_modelm <- hw(trainingtemp, h = length(validationtemp), seasonal = "multiplicative") 
datatable(data.frame(hw_modelm))
MAPE(hw_modelm$mean, validationtemp)*100
[1] 1.525166

From the above output, we have a MAPE of 1.53% meaning that the average difference between the forecasted value and the actual value is 1.53%, which is relatively slightly poorer than the Holt-Winters’ Additive Seasonality Model.

autoplot(hw_modelm) +
  ggtitle("Holt-Winters (Multiplicative Seasonality) Forecasts for \nMonthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  theme_ipsum_rc()

3.6.3 ARIMA

ARIMA models provide another approach to time series forecasting. While exponential smoothing models are based on a description of the trend and seasonality in the data, auto regressive integrated moving average (ARIMA) modeling involves a more detailed analysis of the training data using lags and lagged forecast errors.

The first step is to use a function like auto.arima() to analyze the data and find appropriate model configuration parameters.

arima_optimal <- auto.arima(trainingtemp)
arima_optimal
Series: trainingtemp 
ARIMA(1,0,1)(2,1,1)[12] with drift 

Coefficients:
         ar1      ma1    sar1     sar2     sma1   drift
      0.7912  -0.4842  0.0130  -0.1503  -0.8839  0.0013
s.e.  0.1180   0.1735  0.1349   0.1298   0.1996  0.0018

sigma^2 = 0.1948:  log likelihood = -93.88
AIC=201.76   AICc=202.59   BIC=222.55

The function returned the following model: ARIMA (1,0,1)(2,1,1)[12] with drift.

arima_model <- forecast(arima_optimal)
datatable(data.frame(arima_model))
MAPE(arima_model$mean, validationtemp)*100
[1] 1.552931

From the above output, we have a MAPE of 1.55% meaning that the average difference between the forecasted value and the actual value is 1.55%, which gives us better performance than the benchmark models.

autoplot(arima_model) +
  ggtitle("ARIMA Forecasts for Monthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  theme_ipsum_rc()

3.6.4 Comparisons of Models Tried

MAPE Results of Models Tried
Model MAPE (%)
Naive 2.79
Seasonal Naive 2.07
Simple Exponential Smoothing 2.78
State Space Model 1.47
Holt-Winters’ Model (Additive Seasonality) 1.47
Holt-Winters’ Model (Multiplicative Seasonality) 1.53
ARIMA 1.55

From the above table, the state space model, Holt-Winters’ model and ARIMA model all outperformed the benchmark models (i.e. naive Model and Seasonal Naive Model) for temperature data. We can consider letting users to choose to use these models when forecasting temperature data.

4 Rainfall Data

In this section, we will be cleaning the rainfall data and building models for the rainfall data.

First we select the relevant rainfall related columns needed for this exercise using select(). We will examine the resultant dataframe rainfall using str().

Show the code
rain <- data %>%
  select(tdate, station, daily_rainfall_total) 

str(rain)
Observations
  • The resultant dataframe (rainfall) is a tibble dataframe with the following columns:
    • tdate: refers to the date of the rainfall reading is collected.
    • station: refers to the weather station that collected this rainfall reading.
    • daily_rainfall_total: refers to the total amount of rainfall observed by this weather station in a day. The unit of measurement is in mm.

Let us save the dataframe into an RDS file for future usage using write_rds().

Show the code
write_rds(rain, "data/rainfall.rds")

We will bring in the rainfall data using read_rds(). Let us check the imported RDS data using str() again.

Show the code
rain <- read_rds("data/rainfall.rds")
str(rain)
tibble [329,156 × 3] (S3: tbl_df/tbl/data.frame)
 $ tdate               : Date[1:329156], format: "1980-01-01" "1980-01-02" ...
 $ station             : chr [1:329156] "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" "Macritchie Reservoir" ...
 $ daily_rainfall_total: num [1:329156] 0 0 0 0 22.6 49.6 2.4 0 0 0 ...
Observations
  • rainfall is a tibble dataframe.
  • The data type for the columns are in order: station is character data type, tdate is date data type and daily_rainfall_total is numeric data.

4.1 Investigating missing values

First, let us use summary() to check for missing data.

Show the code
summary(rain)
     tdate              station          daily_rainfall_total
 Min.   :1980-01-01   Length:329156      Min.   :  0.000     
 1st Qu.:1997-04-29   Class :character   1st Qu.:  0.000     
 Median :2011-09-18   Mode  :character   Median :  0.200     
 Mean   :2007-07-02                      Mean   :  6.822     
 3rd Qu.:2017-11-27                      3rd Qu.:  6.500     
 Max.   :2023-12-31                      Max.   :278.600     
 NA's   :58                              NA's   :5136        
Observations
  • The observations ranged from 1 Jan 1980 to 31 Dec 2023. There are 58 rows with missing dates. We should drop these rows since they are unable to tell us which day the observations were made (even if they have rainfall readings).

  • There are 5,136 rows of NAs for daily rainfall.

First, let us drop those rows where date is missing because we would not be able to definitively identify when the temperature(s) were collected (even if there were temperature readings for these rows.

Show the code
rain <- rain %>%
  drop_na(tdate)

summary(rain)
     tdate              station          daily_rainfall_total
 Min.   :1980-01-01   Length:329098      Min.   :  0.000     
 1st Qu.:1997-04-29   Class :character   1st Qu.:  0.000     
 Median :2011-09-18   Mode  :character   Median :  0.200     
 Mean   :2007-07-02                      Mean   :  6.822     
 3rd Qu.:2017-11-27                      3rd Qu.:  6.500     
 Max.   :2023-12-31                      Max.   :278.600     
                                         NA's   :5078        

Let us also do a check of the weather stations.

Show the code
unique(rain$station)
 [1] "Macritchie Reservoir"    "Lower Peirce Reservoir" 
 [3] "Admiralty"               "East Coast Parkway"     
 [5] "Ang Mo Kio"              "Newton"                 
 [7] "Lim Chu Kang"            "Marine Parade"          
 [9] "Choa Chu Kang (Central)" "Tuas South"             
[11] "Pasir Panjang"           "Jurong Island"          
[13] "Nicoll Highway"          "Botanic Garden"         
[15] "Choa Chu Kang (South)"   "Whampoa"                
[17] "Changi"                  "Jurong Pier"            
[19] "Ulu Pandan"              "Mandai"                 
[21] "Tai Seng"                "Jurong (West)"          
[23] "Clementi"                "Sentosa Island"         
[25] "Bukit Panjang"           "Kranji Reservoir"       
[27] "Upper Peirce Reservoir"  "Kent Ridge"             
[29] "Queenstown"              "Tanjong Katong"         
[31] "Somerset (Road)"         "Punggol"                
[33] "Simei"                   "Toa Payoh"              
[35] "Tuas"                    "Bukit Timah"            
[37] "Pasir Ris (Central)"    

4.2 Further exploration of total rainfall using plotly

From the previous section, we noted that there are many weather stations in the rainfall dataframe. Hence, we will make use of plotly to further explore the missing temperatures.

First, we will pivot the dataframe wider.

Show the code
rain_wide <- rain %>%
  pivot_wider(names_from = station, values_from = daily_rainfall_total)

summary(rain_wide)
     tdate            Macritchie Reservoir Lower Peirce Reservoir
 Min.   :1980-01-01   Min.   :  0.000      Min.   :  0.000       
 1st Qu.:1990-12-31   1st Qu.:  0.000      1st Qu.:  0.000       
 Median :2001-12-31   Median :  0.200      Median :  0.400       
 Mean   :2001-12-31   Mean   :  7.145      Mean   :  7.706       
 3rd Qu.:2012-12-30   3rd Qu.:  7.100      3rd Qu.:  8.200       
 Max.   :2023-12-31   Max.   :256.000      Max.   :227.600       
                      NA's   :121          NA's   :11141         
   Admiralty       East Coast Parkway   Ang Mo Kio          Newton       
 Min.   :  0.000   Min.   :  0.000    Min.   :  0.000   Min.   :  0.000  
 1st Qu.:  0.000   1st Qu.:  0.000    1st Qu.:  0.000   1st Qu.:  0.000  
 Median :  0.400   Median :  0.000    Median :  0.400   Median :  0.200  
 Mean   :  6.735   Mean   :  4.977    Mean   :  7.218   Mean   :  6.625  
 3rd Qu.:  6.800   3rd Qu.:  3.800    3rd Qu.:  7.800   3rd Qu.:  6.800  
 Max.   :142.000   Max.   :192.600    Max.   :164.400   Max.   :150.400  
 NA's   :10785     NA's   :10778      NA's   :10901     NA's   :11112    
  Lim Chu Kang     Marine Parade    Choa Chu Kang (Central)   Tuas South     
 Min.   :  0.000   Min.   :  0.00   Min.   :  0.00          Min.   :  0.000  
 1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.00          1st Qu.:  0.000  
 Median :  0.400   Median :  0.00   Median :  0.40          Median :  0.200  
 Mean   :  6.759   Mean   :  5.61   Mean   :  7.51          Mean   :  6.892  
 3rd Qu.:  7.000   3rd Qu.:  4.95   3rd Qu.:  8.60          3rd Qu.:  6.200  
 Max.   :158.600   Max.   :195.40   Max.   :144.20          Max.   :208.200  
 NA's   :11214     NA's   :10933    NA's   :10994           NA's   :11472    
 Pasir Panjang    Jurong Island     Nicoll Highway    Botanic Garden   
 Min.   :  0.00   Min.   :  0.000   Min.   :  0.000   Min.   :  0.000  
 1st Qu.:  0.00   1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000  
 Median :  0.20   Median :  0.000   Median :  0.200   Median :  0.200  
 Mean   :  6.29   Mean   :  6.287   Mean   :  6.548   Mean   :  7.283  
 3rd Qu.:  6.00   3rd Qu.:  5.600   3rd Qu.:  6.200   3rd Qu.:  7.800  
 Max.   :151.60   Max.   :185.200   Max.   :163.800   Max.   :160.400  
 NA's   :11031    NA's   :11582     NA's   :11269     NA's   :11228    
 Choa Chu Kang (South)    Whampoa            Changi        Jurong Pier     
 Min.   :  0.000       Min.   :  0.000   Min.   :  0.00   Min.   :  0.000  
 1st Qu.:  0.000       1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000  
 Median :  0.400       Median :  0.200   Median :  0.00   Median :  0.200  
 Mean   :  7.469       Mean   :  6.888   Mean   :  5.81   Mean   :  7.131  
 3rd Qu.:  8.200       3rd Qu.:  7.100   3rd Qu.:  4.40   3rd Qu.:  7.000  
 Max.   :143.800       Max.   :154.600   Max.   :216.20   Max.   :226.400  
 NA's   :11491         NA's   :11606                      NA's   :275      
   Ulu Pandan          Mandai           Tai Seng       Jurong (West)    
 Min.   :  0.000   Min.   :  0.000   Min.   :  0.000   Min.   :  0.000  
 1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000  
 Median :  0.200   Median :  0.400   Median :  0.000   Median :  0.300  
 Mean   :  7.201   Mean   :  7.218   Mean   :  6.757   Mean   :  7.224  
 3rd Qu.:  6.900   3rd Qu.:  7.150   3rd Qu.:  5.900   3rd Qu.:  7.000  
 Max.   :230.400   Max.   :247.200   Max.   :217.200   Max.   :226.200  
 NA's   :355       NA's   :828       NA's   :5         NA's   :445      
    Clementi       Sentosa Island    Bukit Panjang     Kranji Reservoir 
 Min.   :  0.000   Min.   :  0.000   Min.   :  0.000   Min.   :  0.000  
 1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000  
 Median :  0.300   Median :  0.000   Median :  0.400   Median :  0.300  
 Mean   :  7.216   Mean   :  6.166   Mean   :  7.316   Mean   :  7.041  
 3rd Qu.:  7.200   3rd Qu.:  5.100   3rd Qu.:  7.400   3rd Qu.:  7.100  
 Max.   :239.500   Max.   :220.800   Max.   :235.600   Max.   :239.800  
 NA's   :207       NA's   :365       NA's   :227       NA's   :234      
 Upper Peirce Reservoir   Kent Ridge        Queenstown      Tanjong Katong   
 Min.   :  0.000        Min.   :  0.000   Min.   :  0.000   Min.   :  0.000  
 1st Qu.:  0.000        1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000  
 Median :  0.400        Median :  0.200   Median :  0.200   Median :  0.100  
 Mean   :  7.527        Mean   :  7.365   Mean   :  6.805   Mean   :  6.136  
 3rd Qu.:  7.900        3rd Qu.:  7.400   3rd Qu.:  6.100   3rd Qu.:  5.100  
 Max.   :202.800        Max.   :179.600   Max.   :278.600   Max.   :226.000  
 NA's   :11168          NA's   :10858     NA's   :773       NA's   :392      
 Somerset (Road)      Punggol            Simei           Toa Payoh      
 Min.   :  0.000   Min.   :  0.000   Min.   :  0.000   Min.   :  0.000  
 1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000   1st Qu.:  0.000  
 Median :  0.200   Median :  0.200   Median :  0.000   Median :  0.200  
 Mean   :  6.387   Mean   :  6.594   Mean   :  5.987   Mean   :  7.211  
 3rd Qu.:  6.200   3rd Qu.:  6.400   3rd Qu.:  5.200   3rd Qu.:  7.600  
 Max.   :155.800   Max.   :168.800   Max.   :182.600   Max.   :150.200  
 NA's   :11427     NA's   :10767     NA's   :11013     NA's   :11037    
      Tuas          Bukit Timah     Pasir Ris (Central)
 Min.   :  0.000   Min.   :  0.00   Min.   :  0.000    
 1st Qu.:  0.000   1st Qu.:  0.00   1st Qu.:  0.000    
 Median :  0.200   Median :  0.40   Median :  0.200    
 Mean   :  7.198   Mean   :  7.13   Mean   :  6.165    
 3rd Qu.:  7.600   3rd Qu.:  7.40   3rd Qu.:  5.400    
 Max.   :217.000   Max.   :156.80   Max.   :185.800    
 NA's   :10690     NA's   :10826    NA's   :11057      

We will make use of plotly to explore the daily rainfall for each station using a dropdown list.

Show the code
plot_ly(data = rain_wide, 
        x = ~tdate, 
        y = ~ Admiralty, 
        type = "scatter",
        mode = "lines") |> 
  layout(title = "Total Rain Fall observed by Weather Station", 
       xaxis = list(title = "Date", range(as.Date("1980-01-01"), as.Date("2023-12-31"))), 
       yaxis = list(title = ""), 
      theme_ipsum_rc(plot_title_size = 13, plot_title_margin=4, subtitle_size=11, subtitle_margin=4,  
                 axis_title_size = 8, axis_text_size=8, axis_title_face= "bold", plot_margin = margin(4, 4, 4, 4)),  
       updatemenus = list(list(type = 'dropdown', 
                               xref = "paper", 
                               yref = "paper", 
                               xanchor = "left",
                               x = 0.04,
                               y = 0.95, 
                               buttons = list(
                                 list(method = "update",
                                      args = list(list(y = list(rain_wide$Admiralty)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Admiralty"))),label = "Admiralty"),
                                 list(method = "update",
                                      args = list(list(y = list(rain_wide$`East Coast Parkway`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in East Coast Parkway"))),label = "East Coast Parkway"), 
                                 list(method = "update",
                                      args = list(list(y = list(rain_wide$`Ang Mo Kio`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Ang Mo Kio"))),label = "Ang Mo Kio"), 
                                 list(method = "update",
                                      args = list(list(y = list(rain_wide$Newton)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Newton"))),label = "Newton"), 
                                 list(method = "update",
                                      args = list(list(y = list(rain_wide$`Tuas South`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Tuas South"))),label = "Tuas South"),
                                  list(method = "update",
                                      args = list(list(y = list(rain_wide$`Pasir Panjang`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Pasir Panjang"))),label = "Pasir Panjang"), 
                                  list(method = "update",
                                      args = list(list(y = list(rain_wide$`Jurong Island`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Jurong Island"))),label = "Jurong Island"), 
                                 list(method = "update",
                                      args = list(list(y = list(rain_wide$`Choa Chu Kang (South)`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Choa Chu Kang (South)"))),label = "Choa Chu Kang (South)"), 
                                 list(method = "update", 
                                        args = list(list(y = list(rain_wide$Changi)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Changi"))),label = "Changi"),
                                  list(method = "update",
                                      args = list(list(y = list(rain_wide$`Tai Seng`)), 
                                                  list(yaxis = list(title = "Total Rainfall observed in Tai Seng"))),label = "Tai Seng"),
                                  list(method = "update",
                                      args = list(list(y = list(rain_wide$`Jurong (West)`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Jurong West"))),label = "Jurong West"), 
                                   list(method = "update", 
                                        args = list(list(y = list(rain_wide$Clementi)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Clementi"))),label = "Clementi"), 
                                   list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Sentosa Island`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed in Sentosa"))),label = "Sentosa"), 
                                 list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Macritchie Reservoir`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed  at Macritchie Reservoir"))),label = "Macritchie Reservoir"), 
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Lower Peirce Reservoir`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed  at Lower Peirce Reservoir"))),label = "Lower Peirce Reservoir"),
                                 list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Lim Chu Kang`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Lim Chu Kang"))),label = "Lim Chu Kang"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Marine Parade`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Marine Parade"))),label = "Marine Parade"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Choa Chu Kang (Central)`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Choa Chu Kang (Central)"))),label = "Choa Chu Kang (Central)"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Nicoll Highway`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Nicoll Highway"))),label = "Nicoll Highway"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Botanic Garden`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Botanic Garden"))),label = "Botanic Garden"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$Whampoa)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Whampoa"))),label = "Whampoa"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Jurong Pier`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Jurong Pier"))),label = "Jurong Pier"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Ulu Pandan`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Ulu Pandan"))),label = "Ulu Pandan"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$Mandai)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Mandai"))),label = "Mandai"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Bukit Panjang`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Bukit Panjang"))),label = "Bukit Panjang"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Kranji Reservoir`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Kranji Reservoir"))),label = "Kranji Reservoir"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Upper Peirce Reservoir`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Upper Peirce Reservoir"))),label = "Upper Peirce Reservoir"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Kent Ridge`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Kent Ridge"))),label = "Kent Ridge"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$Queenstown)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Queenstown"))),label = "Queenstown"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Tanjong Katong`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Tanjong Katong"))),label = "Tanjong Katong"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Somerset (Road)`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Somerset (Road)"))),label = "Somerset (Road)"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Punggol`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Punggol"))),label = "Punggol"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Simei`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Simei"))),label = "Simei"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Toa Payoh`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Toa Payoh"))),label = "Toa Payoh"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Tuas`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Tuas"))),label = "Tuas"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Bukit Timah`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Bukit Timah"))),label = "Bukit Timah"),
                                list(method = "update", 
                                        args = list(list(y = list(rain_wide$`Pasir Ris (Central)`)), 
                                                    list(yaxis = list(title = "Total Rainfall observed at Pasir Ris (Central)"))),label = "Pasir Ris (Central)")
                               ))))  
Observations
  • It seems like there are some stations with no rainfall data in all years, while some have rainfall data from certain years onwards. Let’s explore further.

Let us find out the amount of missing values for each weather station using the following code chunk.

missing_values <- rain_wide %>%
  gather(key = "key", value = "val") %>%
  mutate(isna = is.na(val)) %>%
  group_by(key) %>%
  mutate(total = n()) %>%
  group_by(key, total, isna) %>%
  summarise(num.isna = n()) %>%
  mutate(pct = num.isna / total * 100)

levels <-
    (missing_values  %>% filter(isna == T) %>% arrange(desc(pct)))$key

percentage_plot <- missing_values %>%
      ggplot() +
        geom_bar(aes(x = reorder(key, desc(pct)), 
                     y = pct, fill=isna), 
                 stat = 'identity', alpha=0.8) +
      scale_x_discrete(limits = levels) +
      scale_fill_manual(name = "", 
                        values = c('steelblue', 'tomato3'), labels = c("Present", "Missing")) +
      coord_flip() +
      labs(title = "Percentage of missing values", x =
             'Variable', y = "% of missing values")
missing_values %>%
  filter(isna == TRUE) %>% 
  datatable()
percentage_plot

Let us check if there are any stations where they have 100% missing values.

Show the code
norfdata <- missing_values %>%
  filter(isna == TRUE & pct==100)

norfdata$key
character(0)

From the above output, it seems like no stations have 100% missing values.

Let us check if there are any stations with no missing values.

Show the code
allrfdata <- missing_values %>%
  filter(isna == FALSE & pct==100)

allrfdata$key
[1] "Changi" "tdate" 

From the above output, it seems like only Changi weather station has no missing values. This means that for other weather stations there are some amount of missing data for each weather station. Let us impute the missing values in the next section.

4.3 Creating Time Series Object

As mentioned earlier, for us to make use of the time series forecasting packages and their functions, we would need to convert the tibble dataframe into a time series object.

Before we create the time series object, let us first aggregate the daily rainfall readings to monthly rainfall readings by (1) creating the year-month column for each observation using floor_date() and specifying it to derive the year and month of each observation, and (2) aggregate the temperature readings by station and year_month then use summarise() to compute the monthly rainfall reading.

Show the code
#create year-month col
rain$year_month <- floor_date(rain$tdate, "month")
glimpse(rain)
Rows: 329,098
Columns: 4
$ tdate                <date> 1980-01-01, 1980-01-02, 1980-01-03, 1980-01-04, …
$ station              <chr> "Macritchie Reservoir", "Macritchie Reservoir", "…
$ daily_rainfall_total <dbl> 0.0, 0.0, 0.0, 0.0, 22.6, 49.6, 2.4, 0.0, 0.0, 0.…
$ year_month           <date> 1980-01-01, 1980-01-01, 1980-01-01, 1980-01-01, …
Show the code
monthly_rain <- rain %>%                         
  group_by(station, year_month) %>% 
  summarise(total_rf = sum(daily_rainfall_total))

glimpse(monthly_rain)
Rows: 10,813
Columns: 3
Groups: station [37]
$ station    <chr> "Admiralty", "Admiralty", "Admiralty", "Admiralty", "Admira…
$ year_month <date> 2009-01-01, 2009-02-01, 2009-03-01, 2009-04-01, 2009-05-01…
$ total_rf   <dbl> NA, 148.0, NA, 148.8, 205.6, 92.0, 103.0, 90.2, 67.6, 160.0…

With the monthly temperature of all weather stations, let us filter out one weather station (e.g. Admiralty) to create a tibble data frame adm_rf so that we can convert it into an xts object, which is a type of time series object.

adm_rf<- monthly_rain %>%
  filter(station == "Admiralty")

summary(adm_rf)
   station            year_month            total_rf    
 Length:179         Min.   :2009-01-01   Min.   : 15.8  
 Class :character   1st Qu.:2012-09-16   1st Qu.:124.4  
 Mode  :character   Median :2016-07-01   Median :189.0  
                    Mean   :2016-06-19   Mean   :204.9  
                    3rd Qu.:2020-03-16   3rd Qu.:270.4  
                    Max.   :2023-12-01   Max.   :517.4  
                                         NA's   :18     

We will use xts() from xts package to create a time series object. The order.by parameter uses the dates from the adm_rf dataframe. We then use the ts_regular() function to give the time series object adm_rf_xts a regular interval by adding NA values for missing dates.

Just in case there are missing months which we did not detected, we use the na.fill() function fills in those missing dates by extending values from previous days.

Show the code
adm_rf_xts <- xts(adm_rf[,"total_rf"], order.by=as.Date(adm_rf$year_month))
adm_rf_xts<- ts_regular(adm_rf_xts)
adm_rf_xts <- na.fill(adm_rf_xts, "extend")

Let us plot out the monthly rainfall of Admiralty weather station using ggplotly.

p3 <- ggplot(adm_rf_xts, aes(x = Index, y = value)) + 
  geom_line() + theme_clean() +
  labs(title = "Monthly Rainfall of Admiralty Weather Station", caption = "Data from Weather.gov.sg") +
  xlab("Month-Year") +
  ylab("Rainfall (in mm)") +
  theme_ipsum_rc()

ggplotly(p3)

From the above output, we see that there are missing temperatures for numerous time periods. As a result, the line for the above chart is not continuous.

Let us investigate futher using imputeTS package’s ggplot_na_distribution, which highlights the missing values in our data.

ggplot_na_distribution(adm_rf_xts)

4.3.1 Missing Value Imputation

There are several ways to impute missing data in time series objects.

4.3.1.1 Moving Averages

This na_ma()function also allows us to use linear-weighted and exponentially-weighted moving averages.

admrf_imp_movingavg <- na_ma(adm_rf_xts, weighting = "exponential") #default is exponential. Other options are "simple" and "linear". We can allow users to choose if the option they want. 

#plot chart 
ggplot(admrf_imp_movingavg, aes(x = Index, y = value)) + 
  geom_line()

4.3.1.2 Kalman Smoothing

We can also use Kalman Smoothing on ARIMA model to impute the missing values.

admrf_imp_kalman <- na_kalman(adm_rf_xts, model = "auto.arima")

#plot chart 

ggplot(admrf_imp_kalman, aes(x = Index, y = value)) + 
  geom_line()

Kalman Smoothing also has a “StrucTS” option. Let us try and see how it works for our monthly rainfall data.

admrf_imp_kalmans <- na_kalman(adm_rf_xts, model = "StructTS")

#plot chart 

ggplot(admrf_imp_kalmans, aes(x = Index, y = value)) + 
  geom_line()

For the subsequent sections, we will use the results from moving average to build our model.

4.4 Testing if the time series is stationary

Similar to the temperature data, we will also test for stationarity for our rainfall data.

4.4.1 Augmented Dickey-Fuller Test

  • Null Hypothesis: Series is non-stationary, or series has a unit root.
  • Alternative Hypothesis: Series is stationary, or series has no unit root.
adf.test(admrf_imp_movingavg$value)

    Augmented Dickey-Fuller Test

data:  admrf_imp_movingavg$value
Dickey-Fuller = -4.5358, Lag order = 5, p-value = 0.01
alternative hypothesis: stationary

Since the p-value is 0.01, which is less than critical value of 0.05, we reject the null hypothesis. This means that the time series does not have a unit root, meaning it is stationary. It does not have a time-dependent structure.

4.4.2 Kwiatkowski-Phillips-Schmidt-Shin Test

  • Null Hypothesis: Series is trend stationary or series has no unit root.
  • Alternative Hypothesis: Series is non-stationary, or series has a unit root.
kpss.test(admrf_imp_movingavg$value)

    KPSS Test for Level Stationarity

data:  admrf_imp_movingavg$value
KPSS Level = 0.20185, Truncation lag parameter = 4, p-value = 0.1

Since the p-value is 0.1, which is greater than the critical value of 0.05, we fail to reject the null hypothesis of the KPSS test.This means we can assume that the time series is trend stationary.

Both ADF and KPSS tests conclude that the given series is stationary. This means that we can make use of most of the time series forecasting models such as Exponential Smoothing and ARIMA.

4.5 Decomposition of Time Series Object

First, let us plot the monthly rainfall of the Admiralty weather station.

p4 <- ggplot(admrf_imp_movingavg, aes(x = Index, y = value)) + 
  geom_line() + 
  geom_smooth(method=lm) 

ggplotly(p4)

From the above output, it seems like there were fluctuations in monthly rainfall but there was no increasing trend. We also cannot see if there are any seasonality.

Let us decompose the time series object and explore further.

fit <- stl(ts_ts(admrf_imp_movingavg), s.window= "periodic")

## plot out the decomposition results 
autoplot(fit)+ 
  ggtitle("Decomposition for Monthly Rainfall") +
  xlab("Month-Year") + 
  theme_clean()

summary(fit)
 Call:
 stl(x = ts_ts(admrf_imp_movingavg), s.window = "periodic")

 Time.series components:
    seasonal             trend            remainder         
 Min.   :-57.77066   Min.   :136.8089   Min.   :-162.14714  
 1st Qu.:-26.90668   1st Qu.:172.5815   1st Qu.: -58.59127  
 Median : -4.90670   Median :199.8634   Median :  -0.22377  
 Mean   :  0.00000   Mean   :203.0772   Mean   :  -0.11451  
 3rd Qu.: 24.09846   3rd Qu.:236.0493   3rd Qu.:  51.80263  
 Max.   : 77.82946   Max.   :290.0645   Max.   : 271.94494  
 IQR:
     STL.seasonal STL.trend STL.remainder data  
      51.01        63.47    110.39        133.60
   %  38.2         47.5      82.6         100.0 

 Weights: all == 1

 Other components: List of 5
 $ win  : Named num [1:3] 1801 19 13
 $ deg  : Named int [1:3] 0 1 1
 $ jump : Named num [1:3] 181 2 2
 $ inner: int 2
 $ outer: int 0
Note
  • From the above output, we noted that there is no clear linear trend for the monthly rainfall over the years. However, we do note that the inital rainfall in the earlier time period were around 150mm and the later time period has a rainfall level of 200mm.

  • We also observed seasonality in the monthly rainfall over the years.

4.6 Building Models

First we will split the data into training and validation data. Again, similar to temperature data, we will let users decide how much training data and test data they would want, and the duration of the forecast.

Show the code
# create samples 
trainingrf <- ts_ts(head(admrf_imp_movingavg$value, (length(admrf_imp_movingavg$value)-24)))
validationrf <- ts_ts(tail(admrf_imp_movingavg$value, 24)) # the number of months would be the user input  

4.6.1 Benchmark models

4.6.2 Naive method

naive_model <- naive(trainingrf, h = length(validationrf))
datatable(data.frame(naive_model))
MAPE(naive_model$mean, validationrf) * 100
[1] 39.95809

From the above output, we have a MAPE of ~40% meaning that the average difference between the forecasted value and the actual value is ~40%. As compared to the temperature data, this MAPE is considered very high and there are a lot of room for improvement.

autoplot(naive_model) +
  ggtitle("Naive Forecasts for Monthly Total Rainfall") +
  xlab("Month-Year") + 
  ylab("Amount of Rainfall (in mm)") + 
  theme_ipsum_rc()

4.6.3 Seasonal Naive method

snaive_model <- snaive(trainingrf, h = length(validationrf))
datatable(data.frame(snaive_model))
MAPE(snaive_model$mean, validationrf) * 100
[1] 69.78003

The MAPE for seasonal naive model is worse than the naive model with it have a MAPE of ~70%. Let’s try other models and see if there any improvements.

autoplot(snaive_model) +
  ggtitle("Seasonal Naive Forecasts for Monthly Total Rainfall") +
  xlab("Month-Year") + 
  ylab("Amount of Rainfall (in mm)") + 
  theme_ipsum_rc()

4.6.4 Exponential Smoothing Methods

4.6.4.1 Simple exponential smoothing

ses_modelrf <- ses(trainingrf, h = length(validationrf))
datatable(data.frame(ses_modelrf))
MAPE(ses_modelrf$mean, validationrf) * 100
[1] 42.794

From the above output, we have a MAPE of ~43% meaning that the average difference between the forecasted value and the actual value is ~43%. This MAPE is similar to the naive model.

autoplot(ses_modelrf) +
  ggtitle("Simple Expoential Smoothing forecasts for \nMonthly Total Rainfall") +
  xlab("Month-Year") + 
  ylab("Amount of Rainfall (in mm)") + 
  theme_ipsum_rc()

4.6.4.2 State Space Models

ets_modelrf <- ets(trainingrf)
summary(ets_modelrf)
ETS(M,N,A) 

Call:
 ets(y = trainingrf) 

  Smoothing parameters:
    alpha = 0.2289 
    gamma = 3e-04 

  Initial states:
    l = 211.8336 
    s = 28.6946 69.9082 11.4358 -25.3137 -47.8125 -12.06
           -50.4517 33.2959 61.8826 6.3161 -69.7022 -6.1932

  sigma:  0.4492

     AIC     AICc      BIC 
2185.839 2189.267 2231.587 

Training set error measures:
                      ME     RMSE      MAE       MPE     MAPE      MASE
Training set -0.02588395 82.13912 65.51488 -21.92718 45.45668 0.6772009
                   ACF1
Training set 0.03461945

We see ETS (M,N,A). This means we have an ets model with multiplicative errors, no trend and a additive seasonality. Additive seasonality means there aren’t any changes to widths or heights of seasonal periods over time.

ets_forecastrf <- forecast(ets_modelrf, h=length(validationrf))
datatable(data.frame(ets_forecastrf))
MAPE(ets_forecastrf$mean, validationrf)*100
[1] 45.46367

From the above output, we have a MAPE of 45.5%, which is worse than the naive model, our benchmark model.

autoplot(ets_forecastrf) +
  ggtitle("ETS Forecasts for Monthly Total Rainfall") +
  xlab("Month-Year") + 
  ylab("Amount of Rainfall (in mm)") + 
  theme_ipsum_rc()

4.6.5 Holt-Winters

4.6.5.1 Holt-Winters’ Additive Seasonality

hw_modela <- hw(trainingrf, h = length(validationrf), seasonal = "additive")
datatable(data.frame(hw_modela))
MAPE(hw_modela$mean, validationrf)*100
[1] 48.5378
autoplot(hw_modela) +
  ggtitle("Holt-Winters (Additive Seasonality) Forecasts for\nMonthly Total Rainfall") +
  xlab("Month-Year") + 
  ylab("Amount of Rainfall (in mm)") + 
  theme_ipsum_rc()

4.6.5.2 Holt-Winters’ Multiplicative Seasonality

hw_modelm <- hw(trainingrf, h = length(validationrf), seasonal = "multiplicative")
datatable(data.frame(hw_modelm))
MAPE(hw_modelm$mean, validationrf)*100
[1] 63.30767
autoplot(hw_modelm) +
  ggtitle("Holt-Winters (Multiplicative Seasonality) Forecasts for\nMonthly Total Rainfall") +
  xlab("Month-Year") + 
  ylab("Amount of Rainfall (in mm)") + 
  theme_ipsum_rc()

4.6.6 ARIMA

arima_optimal <- auto.arima(trainingrf)
arima_optimal
Series: trainingrf 
ARIMA(0,0,2) with non-zero mean 

Coefficients:
         ma1     ma2      mean
      0.2415  0.1251  196.2097
s.e.  0.0809  0.0825    9.6921

sigma^2 = 8037:  log likelihood = -921.24
AIC=1850.48   AICc=1850.74   BIC=1862.68
arima_model <- forecast(arima_optimal)
datatable(data.frame(arima_model))
MAPE(arima_model$mean, validationrf)*100
[1] 40.18849
autoplot(arima_model) +
  ggtitle("ARIMA Forecasts for Monthly Mean Temperature") +
  xlab("Month-Year") + 
  ylab("Temperature (degree celsius)") + 
  theme_ipsum_rc()

4.6.7 Comparison of Model Performance

Comparison of MAPE across Models
Model MAPE (%)
Naive 39.96%
Seasonal Naive 69.78%
Simple Exponential Smoothing 42.79%
Holt-Winters’ Additive Seasonality 48.54%
Holt-Winters’ Multiplicative Seasonality 63.31%
ARIMA 40.20%

From the above table, we see that except for ARIMA model which has a MAPE of 40%, none of the models outperformed the naive model when it comes to forecasting of rainfall data. As such, we can consider exposing only Naive and ARIMA models for rainfall data prediction on our Shiny Application.

5 Shiny Dashboard Prototype

After working on the previous sections, we can see that not every models work for temperature and rainfall data. Also, we have to impute missing data before we can build such time series forceasting models. As such, the following sections summarise the parameters and output to expose on Shiny application and how we can go about exposing these parameters and output.

5.1 Parameters to expose on Shiny Application

  1. Type of data to forecast:

    • Temperature or Rainfall
  2. Length of forecast:

    • E.g. next 3 months or next 4 years (i.e. 48 months)
  3. Which weather station

    • Our data have many data stations, so users can choose which weather station they want
  4. Method of imputing missing data

    • Choose between Moving Average, Kalman Smoothing on ARIMA or Kalman Smoothing on StrucTS.
  5. Forecasting model to use

    • For temperature data, we can consider allow users to choose from Holt-Winters, State Space and ARIMA models since we got very good data.

    • For rainfall data, if we choose to allow users to do forecasting on our Shiny application, we can consider to allow users to choose from Naive or ARIMA model.

5.2 Output to expose on Shiny Application

  • Data table and plot of the output for missing data imputation chosen

  • Data table and plot of the output for the forecasting model chosen

  • MAPE of the forecasting model chosen

5.3 How to expose these parameters and output

Parameter / Output How to expose it
Type of data to forecast Each data (i.e. temperature or rainfall) is in a different page. Hence, user can choose which data to forecast by clicking on the respective page on the Navbar Pages.
Weather Station A dropdown list to allow user to select which weather station they want to use for forecasting
Temperature / rainfall data of chosen weather station After selecting the type of data to forecast and the weather station, we can display an interactive plot of the past data and show if there are any missing data. This can help the user choose which imputation method to use for the next step.
Method of Imputing Missing Data

A dropdown list to allow user to select which missing data imputation method they want to use. The options are:

  • Moving average (exponential)

  • Moving average (linear)

  • Moving average (simple)

  • Kalman Smoothing (ARIMA)

  • Kalman Smoothing (StrucTS)

Data table and plot of the output for the missing data After selecting the type of missing data imputation method, we can display an interactive plot and data table of the resultant dataframe. This can help them decide which imputation method is most suitable. For example in the earlier section, we realised that Kalman Smoothing (ARIMA) was not suitable for Admiralty weather station’s monthly temperature data after plotting the chart using the resultant dataframe.
Decomposition of Time Series Data To help the users choose the forecasting model, we can also display the output of the time series data’s decomposition
Forecasting Model to use

A dropdown list to allow user to select which forecasting method they want to use. For temperature data, the forecasting methods are:

  • State Space Model

  • Holt Winters’ Additive Seasonality Model

  • Holt Winters’ Multiplicative Model

  • ARIMA

For rainfall data, the forecasting methods are:

  • Naive model

  • ARIMA

Length of forecast A numeric input field to allow user key in the number of months to forecast
Data table, MAPE and plot of the forecasted results After the user has chosen the forecasting method and indicated the length of forecast, we should display the data table of forecasted results, the MAPE and plot ouf the forecasted results.

5.4 Layout on Shiny Application

Based on the above considerations, we plan to have the following layout for the forecasting module in our Shiny Application. Below is an example for the forecasting of temperature data.

Layout of Forecasting Module

Another page with similar in layout would also be created for the forecasting of rainfall data. The sidebar would allow users to make their selections and the mainpanel will display the various plots and data table using tabsets. The MAE, RMSE and MAPE results for the chosen forecasting model will also be displayed at the bottom of the page.

6 References

  • www.weather.gov.sg

  • https://michaelminn.net/tutorials/r-weather/index.html https://towardsdatascience.com/a-guide-to-forecasting-in-r-6b0c9638c261

  • https://www.singstat.gov.sg/-/media/files/publications/reference/ssnsep05-pg11-14.ashx

  • https://otexts.com/fpp2/estimation-and-model-selection.html

  • https://www.michaelplazzer.com/weather-forecasting-with-machine-learning-in-r/

  • https://jtr13.github.io/EDAVold/missingTS.html

  • https://cran.r-project.org/web/packages/imputeTS/vignettes/imputeTS-Time-Series-Missing-Value-Imputation-in-R.pdf

  • https://www.analyticsvidhya.com/blog/2021/06/statistical-tests-to-check-stationarity-in-time-series-part-1/

  • https://hex.tech/blog/stationarity-in-time-series/

  • https://pkg.robjhyndman.com/forecast/reference/auto.arima.html